Skip to content

feat: generate GeoJSON from PostgreSQL instead of re-reading CSV#404

Open
ThibaudDauce wants to merge 13 commits intomainfrom
geojson-from-db
Open

feat: generate GeoJSON from PostgreSQL instead of re-reading CSV#404
ThibaudDauce wants to merge 13 commits intomainfrom
geojson-from-db

Conversation

@ThibaudDauce
Copy link
Copy Markdown
Contributor

@ThibaudDauce ThibaudDauce commented Apr 2, 2026

Stream GeoJSON features directly from the database using a cursor, avoiding a third read of the CSV file. PostgreSQL builds the JSON with json_build_object, so no Python-level casting is needed.

Also extract _detect_geo_columns to deduplicate the geo column detection logic.

Will allow to send this part in another job.

On MN_07_latest-2025-2026.csv: 50s → 18s (-64%)
On joconde.csv: 75s → 43s (-42%)

Same as #402 but for GeoJSON

Stream GeoJSON features directly from the database using a cursor,
avoiding a third read of the CSV file. PostgreSQL builds the JSON
with json_build_object, so no Python-level casting is needed.

Also extract _detect_geo_columns to deduplicate the geo column
detection logic, and add detailed timer marks for the geojson and
pmtiles conversion steps.
@ThibaudDauce ThibaudDauce requested review from Pierlou, bolinocroustibat and maudetes and removed request for Pierlou and bolinocroustibat April 2, 2026 11:39
@ThibaudDauce ThibaudDauce marked this pull request as ready for review April 2, 2026 11:39
@ThibaudDauce ThibaudDauce marked this pull request as draft April 2, 2026 13:03
@ThibaudDauce
Copy link
Copy Markdown
Contributor Author

Still a bug when a lot of column, will push a fix next week!

@bolinocroustibat
Copy link
Copy Markdown
Contributor

bolinocroustibat commented Apr 3, 2026

I might wait for that PR to be finished to be able to base this new PR on it, to separate csv_to_geojson and geojson_to_parquet as independant tasks in the RQ queue. Those tasks needs to make sure the input data is not temporary, so parquet in DB would work in that direction.

cc @ThibaudDauce @maudetes

@ThibaudDauce ThibaudDauce marked this pull request as ready for review April 6, 2026 11:15
# Conflicts:
#	tests/test_analysis/test_parquet_export.py
@maudetes maudetes requested review from Pierlou and maudetes and removed request for Pierlou, bolinocroustibat and maudetes April 14, 2026 09:56
Copy link
Copy Markdown
Contributor

@Pierlou Pierlou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactor 🙏 The flow looks good to me (though I haven't gone through the technicity of the SQL part), but I have a doubt regarding the type of the properties we export, and a couple of remarks.

Comment thread udata_hydra/analysis/geojson.py Outdated
return query, params


async def csv_to_geojson_from_db(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to keep the function name close to its parquet counterpart

Suggested change
async def csv_to_geojson_from_db(
async def save_as_geojson_from_db(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... wouldn't db_to_geojson and db_to_parquet be even more explicit, and more consistent with namings like csv_to_geojson?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in df12267

# ending up here means we either have the exact lat,lon format, or NaN
# skipping row if NaN
if row[geo["latlon"]] is None:
elif "latlon" in geo or "lonlat" in geo:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the insight

Suggested change
elif "latlon" in geo or "lonlat" in geo:
elif "latlon" in geo or "lonlat" in geo:
# skipping row if geo data is None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in df12267

Comment thread udata_hydra/analysis/geojson.py Outdated
Comment on lines +33 to +36

# latlon/lonlat columns can contain values like "[48.8566, 2.3522]" or "48.8566 , 2.3522"
# Both versions below strip spaces and brackets, then split on comma.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The insight is already given within the function, this feels misplaced here

Suggested change
# latlon/lonlat columns can contain values like "[48.8566, 2.3522]" or "48.8566 , 2.3522"
# Both versions below strip spaces and brackets, then split on comma.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in df12267

Comment thread udata_hydra/analysis/geojson.py Outdated
Comment on lines +283 to +286
for col in property_cols:
params.append(col)
placeholder = f"${len(params)}::text"
properties_fragments.append(f"{placeholder}, {_quote_ident(_db_col_name(col))}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT

Suggested change
for col in property_cols:
params.append(col)
placeholder = f"${len(params)}::text"
properties_fragments.append(f"{placeholder}, {_quote_ident(_db_col_name(col))}")
for idx, col in enumerate(property_cols):
params.append(col)
properties_fragments.append(f"${idx + 1}::text, {_quote_ident(_db_col_name(col))}")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in df12267

Comment thread udata_hydra/analysis/geojson.py Outdated
properties_fragments = []
for col in property_cols:
params.append(col)
placeholder = f"${len(params)}::text"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we're exporting all values as text? In which case it is not what we want (columns that contain ints/floats should be exported as such)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No here we talk about the columns/keys, I've added a comment in df12267

Comment on lines +362 to +367
first = True
async for row in cursor:
if not first:
f.write(",\n")
f.write(row[0])
first = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the first trick: why don't we f.write(f"{row[0]},\n") for all rows? (in which case we should finally f.write("]"))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have a trailing comma in the JSON?

Copy link
Copy Markdown
Contributor

@bolinocroustibat bolinocroustibat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. A few remarks otherwise LGTM

with output_file_path.open("w") as f:
f.write('{"type": "FeatureCollection", "features": [\n')
first = True
async for row in cursor:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in case of an error while parsing the rows?
Should we delete output_file_path in a try/finally if we detect failure after partial write, or wrap the stream loop in try/finally to remove the output file on any error after open?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller csv_to_geojson_and_pmtiles in csv.py:193-194 already wraps the call in a try/except that calls remove_remainders(resource_id, ["geojson", "pmtiles", "pmtiles-journal"]). So if an error occurs during streaming, the partial file is cleaned up by the caller.

This is the same pattern as the CSV path — csv_to_geojson doesn't handle cleanup on error either, it's delegated to the caller.

upload_to_minio: bool = True,
) -> tuple[int, str | None] | None:
"""Generate a GeoJSON file by streaming features directly from PostgreSQL."""
geo = _detect_geo_columns(inspection)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: a type hint would be nice here for readability:

geo: dict | None = _detect_geo_columns(inspection)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_detect_geo_columns already have a return-type? You're IDE should provide you the type hint on hover, no?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My IDE doesn't :) I type hint when not obvious at first sight

Comment thread udata_hydra/analysis/geojson.py Outdated
return query, params


async def csv_to_geojson_from_db(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... wouldn't db_to_geojson and db_to_parquet be even more explicit, and more consistent with namings like csv_to_geojson?

Comment thread udata_hydra/analysis/geojson.py Outdated
output_file_path: Path,
upload_to_minio: bool = True,
) -> tuple[int, str | None] | None:
"""Generate a GeoJSON file by streaming features directly from PostgreSQL."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: csv_to_geojson_from_db has a one-line docstring while csv_to_geojson documents args, behavior (skipped rows), and return values. Aligning the DB variant (even briefly: inputs, streaming, parity goal with the CSV path) might make it easier to maintain and choose between the two in case we have to.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in df12267

Comment thread udata_hydra/analysis/geojson.py Outdated
async for row in cursor:
if not first:
f.write(",\n")
f.write(row[0])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Might be a good opportunity to ad a column alias on the outer SELECT in _build_feature_sql, to avoid using row[0], prefering somtehing like row["feature_json"], that would massively improve readability imho.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in df12267

template["features"] = streamable_list(get_features(file_path, inspection, geo))

with output_file_path.open("w") as f:
json.dump(template, f, indent=4, ensure_ascii=False, default=str)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeoJSON files are often large, pretty-printing can add a meaningful amount of redundant bytes on disk and for uploads, even if compression narrows the gap.

Also indent=4 gets a pretty-printed JSON while the database path stream does not re-indent the whole document like indent=4 would. Someone diffing two exports, grepping a file might think something broke when they see the compact stream.

Should we treat compact as the normal export format for both paths, dropping indent=4 on the CSV path? And assume if a human needs to read the file, s.he will format it automatically anyway?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but this path is only for csv_to_geojson (the line only moved), I think we can leave it since it will be removed soon?

Copy link
Copy Markdown
Contributor

@bolinocroustibat bolinocroustibat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better naming suggestion

EDIT: doesn't pass the linting

# Convert to GeoJSON — from DB if available and enabled, otherwise from CSV file
if config.DB_TO_GEOJSON and table_name:
result = await csv_to_geojson_from_db(
result = await save_as_geojson_from_db(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use this opportunity to rename:
save_as_geojson_from_db -> db_to_geojson
save_as_parquet_from_db-> db_to_parquet
Those names match the source_to_target style we already use (csv_to_parquet, csv_to_db, geojson_to_pmtiles, parquet_to_db), they also line up nicely with our config flags.

Optionally to match with that we could also rename save_as_parquet to rows_to_parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants